Overview

This report details an analysis of the GapMinder data, containing various economic statistics for countries across the world, from 1962 to 2007.

Question 1 - Relationship Between Continents and Energy Use

There was a distinct difference in energy use between continents, as there was a strong link between GDP Per Capita and energy usage (Graph 1). Continents with a lot of developed nations such as Europe and Oceania consistently had higher energy usage through the period 1962-2007. Energy usage also increased for most continents over time.

In the graph below, energy user per continent over time shows various trends. One such trend is the difference between the continents with predominantly developed nations (Oceania and Europe), and those consisting of predominantly those of developing countries (Africa). The large drop from 1962 to 1972 for Americas and Asia is due to missing data for developing countries, only rectified in the 1972 data. This leads to the question of suprisingly low median values for Asia and Americas, which includes Japan, Canada and the United States. As these developed economies have much higher energy usages, they are effectively outliers when grouped with their geographical neighbors. So GDP is the hidden factor here, reducing the value of continents as a discriminatory variable.

To further understand the impacts of GDP on energy use, a sample from a single year (2007) was analyzed. Unremarkably, the relationship between energy use and GDP was positive, and also linear, as demonstrated by the regression line in the graph below. This explains many of the features of energy use by continent shown in Graph 1.

Question 2 - Difference between Europe and Asia Imports after 1990

Imports were compared between European and Asian imports were compared from 1992 to 2007. There are four sets of data, spaced 5 years apart. Whilst there is some variation in exports from year to year, the differences where not significant.

More details on the analysis process are detailed in Appendix 2.

Question 3

The country with the highest mean ranking across all times is the Macao Special Administrative Region (SAR), followed by Monaco and Hong Kong SAR.

The top 10 nations are detailed in the table below.

Top 10 Nations By Population Density (per sq km)
country meanDensity
Macao SAR, China 14732
Monaco 14090
Hong Kong SAR, China 5153
Singapore 4361
Gibraltar 2622
Bermuda 1133
Malta 1084
Bangladesh 733
Channel Islands 706
Maldives 662

Question 4

TODO

Appendix 1 - Question 1 Analysis

Issue 1: Missing Data in Oldest Records

From 1962 to 1967, both Americas and the Asian continents data only included developed nations such as USA, Canada and Japan, leading to very high median energy usages. From 1972, less developed nations energy usage data was added, leading to large drops in median energy usage for Asia and the Americas.

Issue 2: High variability of data by continent

A review of the energy use by continent revealed that several continents contained developed nations from the G12, all of which have much higher energy usage than developing nations. These nations represented outliers, making the median a better measure for all summary statistics.

The graphs below demonstrate the issue of developed and developing countries in Asia and the Americas for a single year.

Appendix 2 - Question 2 Analysis

The data was split by year into imports from Asia and Europe. For each year, tests of variance and normality were carried out. the data sets for each year where found to be of non-normal distribution and heteroscedastic.

Attempts at data transformation did not have an effect on distribution.

For these reasons, the non-parametric Wilcoxon Rank Sum Test was used to assess if there was a significant difference in imports between European and Asian continents.

Variance Test

The following code was used on a subset of the the data (eurasia), which only contained data from Europe and Asia, test variance across all years. Both Asian and European data sets across all years were found to heteroscedastic.

for(Y in seq(from=1992, to=2007, by=5)){
  asia <- eurasia %>% filter(continent=="Asia", Year==Y) %>% select(imports)
  europe <- eurasia %>% filter(continent=="Europe", Year==Y)  %>% select(imports)
  bt <- var.test(asia$imports, europe$imports)
  if (bt$p.value < 0.05){
    print(paste("Year ", Y, " - Samples Have Different Variance p=",bt$p.value))
  }else{
    print(paste("Year ", Y, " - Samples Have Same Variance p=", bt$p.value))
  }
}
## [1] "Year  1992  - Samples Have Different Variance p= 0.00177440987783983"
## [1] "Year  1997  - Samples Have Different Variance p= 8.81431299264435e-05"
## [1] "Year  2002  - Samples Have Different Variance p= 3.04778763520197e-05"
## [1] "Year  2007  - Samples Have Different Variance p= 0.000354648159186954"
Population Distribution Test

After the comparison of variances, the samples were assessed to see if they had a normal distribution.The following code was used on a subset of the data (eurasia), which only contained data from Europe and Asia. The data was found to have distributions which were NOT normal.

for(Y in seq(from=1992, to=2007, by=5)){
  asia <- eurasia %>% filter(continent=="Asia", Year==Y) %>% select(imports)
  europe <- eurasia %>% filter(continent=="Europe", Year==Y)  %>% select(imports)
  shap_asia <- shapiro.test(asia$imports)
  shap_europe <- shapiro.test(europe$imports)

  if (shap_asia$p.value < 0.05){
    print(paste("Year ", Y, " - Asia Sample is not Normal Dist p=", shap_asia$p.value))
  }else{
    print(paste("Year ", Y, " - Assia sample has Normal Dist p=", shap_asia$p.value))
  }

  if (shap_europe$p.value < 0.05){
    print(paste("Year ", Y, " - Europe sample is Not Normal Dist p=", shap_europe$p.value))
  }else{
    print(paste("Year ", Y, " - Europe sample has Normal Dist p=", shap_europe$p.value))
  }
}
## [1] "Year  1992  - Asia Sample is not Normal Dist p= 0.00692754055163552"
## [1] "Year  1992  - Europe sample is Not Normal Dist p= 0.00159530937109208"
## [1] "Year  1997  - Asia Sample is not Normal Dist p= 0.0014279242621728"
## [1] "Year  1997  - Europe sample is Not Normal Dist p= 0.0185145480399587"
## [1] "Year  2002  - Asia Sample is not Normal Dist p= 0.000865516067436733"
## [1] "Year  2002  - Europe sample is Not Normal Dist p= 0.0270614166355554"
## [1] "Year  2007  - Asia Sample is not Normal Dist p= 0.000526159274481657"
## [1] "Year  2007  - Europe sample has Normal Dist p= 0.0757556476330589"
Sample Test

Since the data is heteroscedastic, with a non-Normal distribution, the Wilcoxon Rank Sum test was used to confirm or exclude any significant difference in exports for each year of data collected.

The following code was used on a subset of the data (eurasia), which only contained data from Europe and Asia.

The result for all years was that there was NO significant difference in exports between Asia and Europe.

for(Y in seq(from=1992, to=2007, by=5)) {
  asia <- eurasia %>% filter(continent=="Asia", Year==Y) %>% select(imports)
  europe <- eurasia %>% filter(continent=="Europe", Year==Y)  %>% select(imports)
  wt <- wilcox.test(asia$imports, europe$imports)
  print(paste("Year ", Y, " - Wilcoxon Rank Sum. p=",wt$p.value))
}
## [1] "Year  1992  - Wilcoxon Rank Sum. p= 0.481477146310066"
## [1] "Year  1997  - Wilcoxon Rank Sum. p= 0.334321291832349"
## [1] "Year  2002  - Wilcoxon Rank Sum. p= 0.869442490313876"
## [1] "Year  2007  - Wilcoxon Rank Sum. p= 0.406469388988677"